In this vignette, we assess the performance of MSqRob for differential expression analysis using a publicly available spike-in study (PRIDE identifier: PXD003881 Shen et al. [2018]). E. Coli lysates were spiked at five different concentrations (3%, 4.5%, 6%, 7.5% and 9% wt/wt) in a stable human background (four replicates per treatment). The samples were run on an Orbitrap Fusion mass spectrometer. Raw data files were processed with MaxQuant (version 1.6.1.0, Cox and Mann [2008]) using default search settings unless otherwise noted. Spectra were searched against the UniProtKB/SwissProt human and E. Coli reference proteome databases (07/06/2018), concatenated with the default Maxquant contaminant database. C arbamidomethylation of Cystein was set as a fixed modification, and oxidation of Methionine and acetylation of the protein amino-terminus were allowed as variable modifications. In silico cleavage was set to use trypsin/P, allowing two miscleavages. Match between runs was also enabled using default settings. The resulting peptide-to-spectrum matches (PSMs) were filtered by MaxQuant at 1% FDR.
We first set the concertations for the different spike-ins.
concentrations <- (2:6) * 1.5
names(concentrations) <- letters[1:5]
We first import the peptides.txt file. This is the file that contains
your peptide-level intensities. For a MaxQuant search [6], this peptides.txt file
can be found by default in the “path_to_raw_files/combined/txt/” folder from
the MaxQuant output, with “path_to_raw_files” the folder where raw files were saved.
In this tutorial, we use a MaxQuant peptides file of the ecoli spike in study
that is stored in the msdata package. We use the QFeatures package to import the data.
With the grepEcols function we find the columns that are
containing the expression data of the peptides in the peptides.txt file.
library(tidyverse)
library(limma)
library(QFeatures)
library(msqrob2)
library(BiocParallel)
myurl<- "https://www.dropbox.com/s/62hhol96pz0sjbi/peptides.zip?dl=0"
download.file(myurl,"peptides.zip", method="curl", extra = "-L")
unzip("peptides.zip")
peptidesFile <- "peptides.txt"
ecols <- MSnbase::grepEcols(peptidesFile, "Intensity ", split = "\t")
pe <- readQFeatures(table = peptidesFile, fnames = 1, ecol = ecols,
name = "peptideRaw", sep="\t")
pe
## An instance of class QFeatures containing 1 assays:
## [1] peptideRaw: SummarizedExperiment with 32827 rows and 20 columns
We can extract the spikein condition from the raw file name.
cond <- which(strsplit(colnames(pe)[[1]][1], split = "")[[1]] == "a") # find where condition is stored
colData(pe)$condition <- substr(colnames(pe), cond, cond) %>% unlist %>% as.factor
We calculate how many non zero intensities we have per peptide. This will be useful for filtering.
rowData(pe[["peptideRaw"]])$nNonZero <- rowSums(assay(pe[["peptideRaw"]]) > 0)
Peptides with zero intensities are missing peptides and
should be represent with a NA value instead of 0.
pe <- zeroIsNA(pe,"peptideRaw")
In the spik-in study there are peptides from e.coli and human proteins. The ecoli peptides are added at different concerntrations.
myurl <- "https://www.dropbox.com/s/swhu9nqwktythtb/ecoli_up000000625_7_06_2018.fasta?dl=0"
download.file(myurl, "ecoli.fasta", method = "curl", extra = "-L")
myurl <- "https://www.dropbox.com/s/n61n28wrcpwsb4a/human_up000005640_sp_7_06_2018.fasta?dl=0"
download.file(myurl, "human.fasta", method = "curl", extra = "-L")
id <- list(ecoli = 'ecoli.fasta',
human = 'human.fasta') %>%
purrr::map(~{read_lines(.x) %>%
{.[str_detect(.,'^>')]} %>%
str_extract(.,'(?<=\\|).*(?=\\|)')})
We can inspect the missingness in our data with the plotNA() function
provided with MSnbase. 23%
of all peptide intensities are missing and for some peptides we de not
even measure a signal in any sample. The missingness is similar across samples.
MSnbase::plotNA(assay(pe)) +
xlab("Peptide index (ordered by data completeness)")
We normalize the data using vsn normalisation.
Note, that the data should not be log-transformed.
In our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.
pe[["peptideRaw"]] <- pe[["peptideRaw"]][rowData(pe[["peptideRaw"]])$Proteins %in%
smallestUniqueGroups(rowData(pe[["peptideRaw"]])$Proteins), ]
We now remove the contaminants, peptides that map to decoy sequences and proteins, which were only identified by peptides with modifications.
pe[["peptideRaw"]] <- pe[["peptideRaw"]][rowData(pe[["peptideRaw"]])$Reverse != "+", ]
pe[["peptideRaw"]] <- pe[["peptideRaw"]][rowData(pe[["peptideRaw"]])$
Potential.contaminant != "+", ]
We want to keep peptide that were observed at least twice.
pe[["peptideRaw"]] <- pe[["peptideRaw"]][rowData(pe[["peptideRaw"]])$nNonZero >= 2, ]
nrow(pe[["peptideRaw"]])
## [1] 28944
We keep 28944 peptides upon filtering.
pe <- normalize(pe, i = "peptideRaw", method = "vsn", name = "peptideNorm")
## vsn2: 28944 x 20 matrix (1 stratum).
## Please use 'meanSdPlot' to verify the fit.
Upon normalisation the density curves for all samples coincide.
limma::plotDensities(assay(pe[["peptideNorm"]]))
This is more easily seen in a boxplot.
boxplot(assay(pe[["peptideNorm"]]), col = palette()[-1],
main = "Peptide distribtutions after normalisation", ylab = "intensity")
We can visualize our data using a multi-dimensional scaling plot, eg. as provided by the limma package.
limma::plotMDS(assay(pe[["peptideNorm"]]), col = as.numeric(colData(pe)$condition))
The first axis in the plot is showing the leading log fold changes (differences on the log scale) between the samples. We notice that the leading differences (log FC) in the peptide data seems to be driven by the spike-in condition.
We use the standard sumarisation in aggregateFeatures, which is a robust summarisation method.
pe <- aggregateFeatures(pe, i = "peptideNorm", fcol = "Proteins", na.rm = TRUE,
name = "protein")
## Your quantitative and row data contain missing values. Please read the
## relevant section(s) in the aggregateFeatures manual page regarding the
## effects of missing values on data aggregation.
plotMDS(assay(pe[["protein"]]), col = as.numeric(colData(pe)$condition))
We model the protein level expression values using msqrob. By default msqrob2 estimates the model parameters using robust regression.
pe <- msqrob(object = pe, i = "protein", formula = ~condition)
What are the parameter names of the model?
getCoef(rowData(pe[["protein"]])$msqrobModels[[1]])
## (Intercept) conditionb conditionc conditiond conditione
## 23.44881438 0.08851106 0.03863228 0.01051063 -0.07124313
Spike-in condition a is the reference class. So the mean log2 expression for
samples from condition a is (Intercept).
The mean log2 expression for samples from condition b-e is
‘(Intercept)+conditionb’,…,‘(Intercept)+conditione’, respectively.
Hence, the average log2 fold change (FC) between condition b and condition a
is modelled using the parameter ‘conditionb’. Thus, we assess the contrast
‘conditionb = 0’ with our statistical test. The same holds for comparison c-a, d-a, e-a.
comparisonsRef <- paste0(paste0("condition", letters[2:5]), " = 0")
comparisonsRef
## [1] "conditionb = 0" "conditionc = 0" "conditiond = 0" "conditione = 0"
The test for average log2 FC between condition c and b will assess the contrast ‘conditionc - conditionb = 0’, …
comparisonsOther <- paste0(
apply(
combn(paste0("condition", letters[2:5]), 2)[2:1, ],
2,
paste,
collapse = " - ")
, " = 0")
comparisonsOther
## [1] "conditionc - conditionb = 0" "conditiond - conditionb = 0"
## [3] "conditione - conditionb = 0" "conditiond - conditionc = 0"
## [5] "conditione - conditionc = 0" "conditione - conditiond = 0"
comparisons <- c(comparisonsRef, comparisonsOther)
We make the contrast matrix using the makeContrast function
L <- makeContrast(comparisons, parameterNames = paste0("condition", letters[2:5]))
L
## conditionb conditionc conditiond conditione conditionc - conditionb
## conditionb 1 0 0 0 -1
## conditionc 0 1 0 0 1
## conditiond 0 0 1 0 0
## conditione 0 0 0 1 0
## conditiond - conditionb conditione - conditionb
## conditionb -1 -1
## conditionc 0 0
## conditiond 1 0
## conditione 0 1
## conditiond - conditionc conditione - conditionc
## conditionb 0 0
## conditionc -1 -1
## conditiond 1 0
## conditione 0 1
## conditione - conditiond
## conditionb 0
## conditionc 0
## conditiond -1
## conditione 1
And we adopt the hypothesis tests for each contrast.
pe <- hypothesisTest(object = pe, i = "protein", contrast = L)
Here, we show the 6 most DE proteins for the comparison b-a.
rowData(pe[["protein"]]) %>%
.$"conditionb" %>%
rownames_to_column("protein") %>%
arrange(pval) %>%
column_to_rownames("protein") %>%
head
## logFC se df t pval adjPval
## P0A799 0.6536363 0.03617586 16.82604 18.06830 1.878796e-12 7.186060e-09
## P0A853 0.5943575 0.03263387 16.19385 18.21290 3.264165e-12 7.186060e-09
## P0A7J3 0.6367692 0.03710347 16.49710 17.16199 5.984403e-12 8.783109e-09
## P0A7S9 0.6940529 0.04175391 15.93352 16.62247 1.735147e-11 1.909963e-08
## P0A6M8 0.6141933 0.03858488 16.18665 15.91798 2.609115e-11 2.297587e-08
## P0C0L2 0.6438649 0.04024479 15.57908 15.99871 4.374996e-11 3.210518e-08
We do the same for comparison c-b.
rowData(pe[["protein"]]) %>%
.$"conditionc - conditionb" %>%
rownames_to_column("protein") %>%
arrange(pval) %>%
column_to_rownames("protein") %>%
head
## logFC se df t pval adjPval
## P0A799 0.4477370 0.03653723 16.82604 12.25427 8.272065e-10 1.705967e-06
## P60723 0.4860439 0.03805471 15.96275 12.77224 8.537166e-10 1.705967e-06
## P0A6F3 0.4742037 0.03685706 15.43962 12.86602 1.162367e-09 1.705967e-06
## P0A853 0.3906772 0.03334316 16.19385 11.71686 2.533679e-09 2.788947e-06
## P0A6M8 0.4407012 0.03858488 16.18665 11.42160 3.690290e-09 3.172843e-06
## P0A7J3 0.4162573 0.03737184 16.49710 11.13826 4.323656e-09 3.172843e-06
volcano <- rowData(pe[["protein"]]) %>%
.$"conditionb"%>%
ggplot(aes(x = logFC, y = -log10(pval),
color = adjPval < 0.01)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) + theme_minimal() +
geom_vline(xintercept = log2(concentrations["b"]/concentrations["a"]), col = "red")
volcano
Note, that the majority of the significant DE proteins are upregulated. In the spike-in study all ecoli proteins are indeed present at a higher concentration in condition b than in condition a. We also observe that the DE proteins are indeed close to the spiked in fold change of 0.58. The log2 FC of the spike-in proteins is indicated with the vertical red line in the plot.
We first select the names of the significant proteins.
sigNames <- rowData(pe[["protein"]]) %>%
.$"conditionb" %>%
rownames_to_column("protein") %>%
filter(adjPval<0.01) %>%
pull(protein)
heatmap(assay(pe[["protein"]])[sigNames, ])
Indeed, the majority of the DE proteins at 1% FDR seem to be spiked.
####Detail plots
We first extract the normalized peptide expression values for a particular protein.
for (protName in sigNames[1:5]) {
pePlot <- pe[protName, , c("peptideNorm", "protein")]
pePlotDf <- data.frame(longFormat(pePlot))
pePlotDf$assay <- factor(pePlotDf$assay,
levels = c("peptideNorm", "protein"))
pePlotDf$condition <- as.factor(colData(pePlot)[pePlotDf$colname, "condition"])
p1 <- ggplot(data = pePlotDf,
aes(x = colname,
y = value,
group = rowname)) +
geom_line() + geom_point() + theme_minimal() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_grid(~ assay) + ggtitle(protName)
print(p1)
p2 <- ggplot(pePlotDf, aes(x = colname, y = value, fill = condition)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = .1), aes(shape = rowname)) +
scale_shape_manual(values = 1:nrow(pePlotDf)) +
labs(title = protName, x = "sample", y = "Peptide intensity (log2)") +
theme_minimal() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_grid(~assay)
print(p2)
}
Because we are analysing a spike-in study we know the ground truth, i.e. we know that only the spike-in proteins (ecoli) are differentially expressed. We can therefore evaluate the performance of the method, i.e. we will assess
the sensitivity or true positive rate (TPR), the proportion of actual positives that are correctly identified, in the protein list that we return \[TPR=\frac{TP}{\text{#actual positives}},\] here TP are the true positives in the list. The TPR is thus the fraction of ups proteins that we can recall.
false discovery proportion (FPD): fraction of false positives in the protein list that we return: \[FPD=\frac{FP}{FP+TP},\] with FP the false positives. In our case the yeast proteins that are in our list.
Instead of only calculating these metric for the protein list that is returned for the chosen FDR level, we can do this for all possible FDR cutoffs so that we get an overview of the quality of the ranking of the proteins in the protein list.
We first add the ground truth data to the rowData of the object.
accessions <- rownames(pe[["protein"]]) %>%
data_frame(protein=.)
accessions <- accessions %>%
transmute(protein=as.character(protein),proteins = strsplit(protein, ';')) %>%
unnest %>%
mutate(human = proteins %in% id$human, ecoli = proteins %in% id$ecoli) %>%
group_by(protein) %>%
summarise(human = any(human), ecoli = any(ecoli)) %>%
right_join(accessions)
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(proteins)`
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "protein"
rowData(pe[["protein"]])$accession <- accessions
Check that all accessions are either human or ecoli:
nrow(accessions)
## [1] 4710
sum(accessions$human)
## [1] 3954
sum(accessions$ecoli)
## [1] 756
sum(accessions$human) + sum(accessions$ecoli)
## [1] 4710
Function to calculate TPR and FDP
tprFdp <- function(pval, tp, adjPval){
ord <- order(pval)
return(data.frame(
pval = pval[ord],
adjPval = adjPval[ord],
tpr = cumsum(tp[ord])/sum(tp),
fdp = cumsum(!tp[ord])/1:length(tp)))
}
Performance Plot
tprFdpCompBa <- tprFdp(rowData(pe[["protein"]])$"conditionb"$pval,
rowData(pe[["protein"]])$accession$ecoli,rowData(pe[["protein"]])$"conditionb"$adjPval)
tprFdpPlotBa <- tprFdpCompBa %>%
ggplot(aes(x = fdp, y = tpr)) +
geom_path() +
geom_vline(xintercept=0.01,lty=2) +
geom_point(data = tprFdpCompBa[sum(tprFdpCompBa$adjPval < 0.01, na.rm = TRUE), ],
aes(x = fdp, y = tpr), cex = 2)
tprFdpPlotBa
tprFdpPlotBa + xlim(0,.25)
## Warning: Removed 4038 row(s) containing missing values (geom_path).
The toplist that is returned at the 1% FDR level is indeed close to 1% FDP, indicating that the FDR level is well controlled. (see dot on performance plot).
contrastNames <- colnames(L)
FCs <- apply(
combn(letters[1:5], 2),
2,
function(x) concentrations[x[2]]/concentrations[x[1]]
)
names(FCs) <- contrastNames
volcano <- list()
for (i in contrastNames)
{
volcano[[i]] <- rowData(pe[["protein"]])[[i]] %>%
ggplot(aes(x = logFC, y = -log10(pval),
color = adjPval < 0.01)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal() +
geom_vline(xintercept=log2(FCs[i]),col="red") +
ggtitle(i)
}
volcano
## $conditionb
##
## $conditionc
##
## $conditiond
##
## $conditione
##
## $`conditionc - conditionb`
##
## $`conditiond - conditionb`
##
## $`conditione - conditionb`
##
## $`conditiond - conditionc`
##
## $`conditione - conditionc`
##
## $`conditione - conditiond`
for (i in contrastNames)
{
sigNames <- rowData(pe[["protein"]])[[i]] %>%
rownames_to_column("protein") %>% filter(adjPval < 0.01) %>% pull(protein)
heatmap(assay(pe[["protein"]])[sigNames, ], main = i)
}
We again observe that the majority of the proteins that are returned in all comparisons are spike-in proteins.
tprFdps<-list()
tprFdpPlots <- list()
for (i in contrastNames)
{
tprFdps[[i]] <- tprFdp(rowData(pe[["protein"]])[[i]]$pval,
rowData(pe[["protein"]])$accession$ecoli,rowData(pe[["protein"]])[[i]]$adjPval)
tprFdpPlots[[i]] <- tprFdps[[i]] %>%
ggplot(aes(x = fdp, y = tpr)) +
geom_path() +
geom_vline(xintercept = 0.01, lty = 2) +
geom_point(data = tprFdps[[i]][sum(tprFdps[[i]]$adjPval < 0.01, na.rm=TRUE), ], aes(x = fdp, y = tpr), cex = 2) +
ggtitle(i)
}
tprFdpPlots
## $conditionb
##
## $conditionc
##
## $conditiond
##
## $conditione
##
## $`conditionc - conditionb`
##
## $`conditiond - conditionb`
##
## $`conditione - conditionb`
##
## $`conditiond - conditionc`
##
## $`conditione - conditionc`
##
## $`conditione - conditiond`
The performance gets better when the logFC between the compared spike-in conditions increase.
Note, that the FDP is not close to the chosen FDR of 0.01 for the comparisons involving condition e. This is probably due to the competition effects because of spiking the ecoli proteins in at a relative high concentration.
pe <- aggregateFeatures(pe, "peptideNorm", fcol = "Proteins", na.rm = TRUE,
name = "proteinMedian", fun = matrixStats::colMedians)
## Your quantitative and row data contain missing values. Please read the
## relevant section(s) in the aggregateFeatures manual page regarding the
## effects of missing values on data aggregation.
pe <- msqrob(object = pe, i = "proteinMedian", formula = ~condition)
pe <- hypothesisTest(object = pe, i = "proteinMedian", contrast = L)
volcanoMed<-list()
for (i in contrastNames)
{
volcanoMed[[i]] <- rowData(pe[["proteinMedian"]])[[i]] %>%
ggplot(aes(x = logFC, y = -log10(pval),
color = adjPval < 0.01)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal() +
geom_vline(xintercept = log2(FCs[i]), col = "red") +
ggtitle(paste("median summarisation",i))
}
volcanoMed
## $conditionb
##
## $conditionc
##
## $conditiond
##
## $conditione
##
## $`conditionc - conditionb`
##
## $`conditiond - conditionb`
##
## $`conditione - conditionb`
##
## $`conditiond - conditionc`
##
## $`conditione - conditionc`
##
## $`conditione - conditiond`
Note, that less proteins are found to be DE upon median summarisation.
Add accession slot to rowData of proteinMedian assay. (First check if same proteins are in both the protein and proteinMedian assay).
mean(rownames(pe[["proteinMedian"]]) == rownames(pe[["protein"]]))
## [1] 1
rowData(pe[["proteinMedian"]])$accession <- rowData(pe[["protein"]])$accession
tprFdpMed<-list()
tprFdpPlotMed <- list()
for (i in contrastNames)
{
tprFdpMed[[i]] <- tprFdp(rowData(pe[["proteinMedian"]])[[i]]$pval,
rowData(pe[["proteinMedian"]])$accession$ecoli,
rowData(pe[["proteinMedian"]])[[i]]$adjPval)
hlp <- rbind(cbind(tprFdps[[i]], method = "robustRlm"),
cbind(tprFdpMed[[i]], method = "medianRlm"))
tprFdpPlotMed[[i]] <- hlp %>%
ggplot(aes(x = fdp, y = tpr, color = method)) +
geom_path() +
ggtitle(i)
}
tprFdpPlotMed
## $conditionb
##
## $conditionc
##
## $conditiond
##
## $conditione
##
## $`conditionc - conditionb`
##
## $`conditiond - conditionb`
##
## $`conditione - conditionb`
##
## $`conditiond - conditionc`
##
## $`conditione - conditionc`
##
## $`conditione - conditiond`
The mean summarisation is vastly outperformed by robust summarisaton. This clearly indicates that differences in peptide species have to be accounted for in the summarisation.
msqrob2 can also be used to adopt parameter estimation using robust
ridge regression by setting the argument ridge = TRUE. The performance of ridge
regression generally improves for more complex designs with multiple conditions.
Note, that the parameter names for ridge regression always start with the string “ridge”. So we have to adjust the contrast matrix.
pe <- msqrob(object = pe, i = "protein", formula = ~condition,
modelColumnName = "ridge", ridge = TRUE)
Lridge <- L
rownames(Lridge) <- paste0("ridge", rownames(L))
pe <- hypothesisTest(object = pe, i = "protein", contrast = Lridge,
modelColumn = "ridge", resultsColumnNamePrefix = "ridge")
## Warning in sqrt(sapply(models, varContrast, L = contrast)): NaNs produced
volcanoRidge<-list()
for (i in contrastNames)
{
volcanoRidge[[i]] <- rowData(pe[["protein"]])[[paste0("ridge",i)]] %>%
ggplot(aes(x = logFC, y = -log10(pval), color = adjPval < 0.01)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal() +
geom_vline(xintercept=log2(FCs[i]),col="red") +
ggtitle(paste("robust ridge",i))
}
volcanoRidge
## $conditionb
##
## $conditionc
##
## $conditiond
##
## $conditione
##
## $`conditionc - conditionb`
##
## $`conditiond - conditionb`
##
## $`conditione - conditionb`
##
## $`conditiond - conditionc`
##
## $`conditione - conditionc`
##
## $`conditione - conditiond`
tprFdpRidge <- list()
tprFdpPlotRidge <- list()
for (i in contrastNames)
{
tprFdpRidge[[i]] <- tprFdp(rowData(pe[["protein"]])[[paste0("ridge",i)]]$pval,
rowData(pe[["protein"]])$accession$ecoli, rowData(pe[["protein"]])[[paste0("ridge", i)]]$adjPval)
hlp <- rbind(cbind(tprFdps[[i]], method = "robustRlm"),
cbind(tprFdpMed[[i]], method = "medianRlm"),
cbind(tprFdpRidge[[i]], method = "robustRidge"))
tprFdpPlotRidge[[i]] <- hlp %>%
ggplot(aes(x = fdp, y = tpr, color = method)) +
geom_path() +
ggtitle(i)
}
tprFdpPlotRidge
## $conditionb
##
## $conditionc
##
## $conditiond
##
## $conditione
##
## $`conditionc - conditionb`
##
## $`conditiond - conditionb`
##
## $`conditione - conditionb`
##
## $`conditiond - conditionc`
##
## $`conditione - conditionc`
##
## $`conditione - conditiond`
Ridge regression further improves the performance for allmost all comparisons.
For some proteins the models cannot be fitted because the design matrix is not full rank as a result of missingness. In our one-way anova design this happens because all protein expression values for one or more conditions are missing. Note, that we have to be very careful with the inference for these proteins because the reference level of factors can even change. Indeed for some proteins, especially spike-in proteins, the protein expression values for the lowest spike-in condition a are missing.
fitErrors <- which(sapply(rowData(pe[["protein"]])$msqrobModels,
getFitMethod) == "fitError")
yFitErrors <- assay(pe[["protein"]])[fitErrors, ]
There are 307 proteins with fit errors.
We first calculate the prior degrees of freedom and the prior variance so as to calculate the posterior variances.
vars <- sapply(rowData(pe[["protein"]])$msqrobModels, getVar)
dfs <- sapply(rowData(pe[["protein"]])$msqrobModels, getDF)
priorEst <- limma::fitFDist(vars, dfs)
Next, we estimate the models by dropping the levels of the factor for which all observations are missing.
modelsFitErrors <- apply(yFitErrors, 1, function(y, data, formula){
# Remove information for which we have missing values
sel <- !is.na(y)
y <- y[sel]
y <- matrix(y, nrow = 1)
data <- subset(data, sel)
# Drop unused levels of factors of the data
factorColumns <- which(sapply(data, class) == "factor")
for (j in factorColumns)
data[[j]] <- droplevels(data[[j]])
# Fit the model and adopt Empirical Bayes variance estimation
out <- try(msqrobLm(y, formula, data)[[1]], silent=TRUE)
if (class(out)=="try-error") {
out <- StatModel(type = "fitError",
params = list(coefficients = NA,
vcovUnscaled = NA,
sigma = NA, df.residual = NA, w = NULL),
varPosterior = as.numeric(NA),
dfPosterior = as.numeric(NA))
} else {
slot(out,"varPosterior") <- limma:::.squeezeVar(getVar(out), getDF(out),
priorEst$scale, priorEst$df2)
slot(out,"dfPosterior") <- getDF(out) + priorEst$df2
}
return(out)
}, formula = ~condition, data = colData(pe)
)
Calculate reference levels for proteins with fit errors
refLevels <- apply(yFitErrors, 1, function(y, data){
levels(droplevels(data$condition[!is.na(y)]))[1]
}, data = colData(pe))
Parameters for some conditions are missing. We therefore remove the parameters from the contrast for which the contrast equals zero to enable inference for models that contain all model parameters involving a particular contrast.
inferenceErrorRefLevelA <- list()
for (j in 1:ncol(L)) {
contrast <- L[, j]
contrast[contrast == 0] <- NA
inferenceErrorRefLevelA[[colnames(L)[j]]] <-
topFeatures(modelsFitErrors[refLevels == "a"], na.exclude(contrast), sort = FALSE)
}
For these proteins the reference class has changed. We first set inference for all contrasts involving the reference level “a” equal to NA. We do this by setting all coefficients of the contrast equal to NA.
inferenceErrorRefLevelAltered <- list()
for (j in 1:4) {
contrast <- L[, j]
contrast[] <- NA
inferenceErrorRefLevelAltered[[colnames(L)[j]]] <- topFeatures(modelsFitErrors[refLevels!="a"],
na.exclude(contrast), sort = FALSE)
}
Next, we perform inference for the remaining contrasts and correct for the change in reference level.
protNamesHlp <- names(modelsFitErrors[refLevels != "a"])
for (j in 5:ncol(L)) {
inferenceErrorRefLevelAltered[[colnames(L)[j]]] <-
sapply(protNamesHlp, function(i, models, contrast, refLevels){
contrast[contrast == 0] <- NA
if (refLevels[i] != "a")
contrast[paste0("condition",refLevels[i])] <- NA
topFeatures(models[i], na.exclude(contrast)) %>% unlist
},
contrast = L[, j],
refLevels = refLevels, models = modelsFitErrors[refLevels != "a"]) %>%
t %>%
data.frame
}
Note, that we have to reperform the FDR correction. Indeed, the FDR is a set property and has to be calculated using all results (default msqrob2 + custom msqrob2 models) for each contrast. We store the results in the rowData of the assay in columns starting with “rlmOpt” followed by the name of the contrast.
for (j in colnames(L)) {
# Combining the results
featureTableHlp <-rowData(pe[["protein"]])[[j]]
featureTableHlp[rownames(inferenceErrorRefLevelA[[j]]), ] <- inferenceErrorRefLevelA[[j]]
featureTableHlp[rownames(inferenceErrorRefLevelAltered[[j]]), ] <- inferenceErrorRefLevelAltered[[j]]
featureTableHlp$adjPval <- p.adjust(featureTableHlp$pval, "BH")
# Store results in rowData of the protein assay
rowData(pe[["protein"]])[[paste0("rlmOpt", j)]] <- featureTableHlp
}
nProtDefault <- sapply(colnames(L),
function(j){
sum(!is.na(rowData(pe[["protein"]])[[j]]$pval))
})
nProtCustom <- sapply(colnames(L),
function(j){
sum(!is.na(rowData(pe[["protein"]])[[paste0("rlmOpt", j)]]$pval))
})
cbind(default = nProtDefault, custom = nProtCustom,
extra = nProtCustom - nProtDefault)
## default custom extra
## conditionb 4403 4453 50
## conditionc 4403 4444 41
## conditiond 4403 4450 47
## conditione 4403 4447 44
## conditionc - conditionb 4403 4502 99
## conditiond - conditionb 4403 4510 107
## conditione - conditionb 4403 4505 102
## conditiond - conditionc 4403 4518 115
## conditione - conditionc 4403 4514 111
## conditione - conditiond 4403 4526 123
We observe that we can retrieve additional results for 41 to 123 proteins.
tprFdpRlmOpt <- tprFdpRlmOptPlot <- list()
for (i in contrastNames) {
tprFdpRlmOpt[[i]] <- tprFdp(rowData(pe[["protein"]])[[paste0("rlmOpt",i)]]$pval %>% unlist,
rowData(pe[["protein"]])$accession$ecoli,rowData(pe[["protein"]])[[paste0("rlmOpt",i)]]$adjPval %>% unlist)
hlp <- rbind(cbind(tprFdpRlmOpt[[i]], method = "error"),
cbind(tprFdps[[i]], method = "noError"))
tprFdpRlmOptPlot[[i]] <- hlp %>%
ggplot(aes(x = fdp, y = tpr, color = method)) +
geom_path() +
geom_vline(xintercept = 0.01, lty = 2)
ggtitle(i)
}
tprFdpRlmOptPlot
## $conditionb
##
## $conditionc
##
## $conditiond
##
## $conditione
##
## $`conditionc - conditionb`
##
## $`conditiond - conditionb`
##
## $`conditione - conditionb`
##
## $`conditiond - conditionc`
##
## $`conditione - conditionc`
##
## $`conditione - conditiond`
As expected, the ranking does not improve much for the comparisons involving the low spike-in concentration because the ecoli proteins are mainly missing for these conditions.